Predicting Wine Quality from Physicochemical Properties by Vincent Shields

The datasets used in this project were first used in a study by [Cortez et al., 2009], and later were made publicly available. The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. The authors of the study request that their work be cited: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. A more comprensive works cited will be provided as a text file with this submission.The chemical data were obtained through objective tests, and the output variable (Quality) was based on sensory data recorded by wine tasting profesionals (median of at least 3 evaluations). The Quality ranges from 0 (very bad), to 10 (excellent), but as you will see, no wines were rated zero or ten. I will provide the text file that the original authors of the study provide to describe each variable specifically. Lets dive deeper into the data to get a better sense of what is going on.

Univariate Plots Section

## 'data.frame':    6497 obs. of  13 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ quality_num         : num  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500   1st Qu.: 1.800  
##  Median : 7.000   Median :0.2900   Median :0.3100   Median : 3.000  
##  Mean   : 7.215   Mean   :0.3397   Mean   :0.3186   Mean   : 5.443  
##  3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900   3rd Qu.: 8.100  
##  Max.   :15.900   Max.   :1.5800   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  1.00      Min.   :  6.0       
##  1st Qu.:0.03800   1st Qu.: 17.00      1st Qu.: 77.0       
##  Median :0.04700   Median : 29.00      Median :118.0       
##  Mean   :0.05603   Mean   : 30.53      Mean   :115.7       
##  3rd Qu.:0.06500   3rd Qu.: 41.00      3rd Qu.:156.0       
##  Max.   :0.61100   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.50  
##  Median :0.9949   Median :3.210   Median :0.5100   Median :10.30  
##  Mean   :0.9947   Mean   :3.219   Mean   :0.5313   Mean   :10.49  
##  3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:11.30  
##  Max.   :1.0390   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality       quality_num   
##  Min.   :3.000   Min.   :3.000  
##  1st Qu.:5.000   1st Qu.:5.000  
##  Median :6.000   Median :6.000  
##  Mean   :5.818   Mean   :5.818  
##  3rd Qu.:6.000   3rd Qu.:6.000  
##  Max.   :9.000   Max.   :9.000

The combined red wine and white wine dataset consists of 6497 observations and 12 original variables. The red wine dataset consists of 1599 observations and the white wine dataset contains 4898 observations. The quality_num variable was created by converting the variable quality to a numeric type for the sake of certain visualizations. A new factor variable was also created for the white wine dataset, based on sweetness parameters from the Reisling sugar guidlines. This will be explianed in more detail later. Now lets look at some variable distributions.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.818   6.000   9.000
## 
##    3    4    5    6    7    8    9 
##   30  216 2138 2836 1079  193    5

In the variable descriptions text, the authors state that, “The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones).” Looking at the distribution of quality for white and red wine, this appears to be true.

## [1] "Red Wine Volatile Acidity"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
## [1] "White Wine Volatile Acidity"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

The distribution of volatile acidity appears bimodal for red wine and positively skewed for white wine. Red wine contains more volatile acidity on average than white wine. I wonder if we could get a better sense of the spread by utilizing a boxplot visual. This could give us a better sense of the outliers.

The distribution for red wine appears to have more variance than white wine. There was a lot of datapoints overlapping inside of the quartile range for white wine, so the alpha parameter is smaller for the white wine distribution.

The distribution of percent alcohol by volume appears to be positively skewed for red wine and almost bimodal for white wine, but not quite. The median % alcohol is 10.20 for red wine and 10.40 for white wine. It is interesting that red and white wines have similar median values for alcohol, but quite a large difference for volatile acidity.

## [1] "Red Wine Alcohol by Volume"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## [1] "White Wine Alcohol by Volume"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

## [1] "Red Wine Total Sulfur Dioxide"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
## [1] "White Wine Total Sulfur Dioxide"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

White wines appear to have a normal distribution of total sulfur dioxide. For red wine, the distribution of total sulfur dioxide is positively skewed. White wines have more than double the amount of sulfur dioxide than red wines on average. Before looking at acidity and suflur dioxide in more detail, I wonder what the distribution looks like for the categorical variable representing the reisling sugar guidlines I created.

This plot displays the count of the different levels from the variable “Scale”. This variable represents the Riesling Sugar Guidlines, set by the International Riesling foundation. Reisling is a white grape variety commonly used in white wines, so this variable was only created in the white wine dataset. The Reisling grape variety is said to have high levels of acidity, however it appears that red wine has more volatile acidity on average. I wonder if there are other variables present in white wine that affect volatile acidity, or if purple grapes are simply more acidic in general. The Reisling sugar guidlines are based on the amount of acidity, residual sugar, and ph levels in the wine. Here is the chart for reference: alt text

## [1] "Reisling scale count"
## 
##          Dry   Medium Dry Medium Sweet        Sweet 
##         2301         1923          620           54

Most white wines fall under the “Dry” category on the reisling scale. It will be interesting to see if there is any relationship between these levels and the output variable quality.


Other Variables

The distribution for citric acid appears somewhat bimodal for red wine and normal for white wine.

Both distributions of fixed acidity appear slightly positively skewed, but almost normal.

Both distributions for residual sugar appear positively skewed.

Again we see two distributions that appear positively skewed, but the distribution of chlorides for red wine looks like it could be almost considered normal

The distribution of free sulfur dioxide for red and white wine appear positively skewed. I wonder why so many of the distributions appear positively skewed.

Finally we see some distributions that look a little different. The distribution of density (measured in g/cm^3) appears normal for both red and white wine. The distribution for white wine, however, looks clustered towards the left.

The distribution of pH for both red and white wine appear normal.

The distribution of potassium sulphate for both red and white wine appear positively skewed.

Univariate Analysis

What is the structure of your dataset?

The datasets used contain 1599 observations of red wines, 4898 observations of white wines and 6497 observations total. The output variable, quality, is an integer representing the quality of the wines observed from 0-10. However, there are no wines rated at a quality of less than 3 and no wines rated at a quality of greater than 9. For white wines, most wines observed are rated at a quality of 6. For red wines, most wines observed are rated at a quality of 5. However, we must keep in mind that the difference in sample size could play a large role in this difference.

Other observations:

  • The median % alcohol by volume is 10.20 for red wines and 10.40 for white wines
  • the median amount of volatile acidity in red wines is 0.5200 grams/decimiters cubed and 0.2600 grams/decimiters cubed in white wines
  • Most white wines are considered ‘Dry’ according to the reisling sugar guidlines
  • In total, about 25% of wines have a quality rating less than 5
  • The median amount of total sulfur dioxide is 38.00 milligrams/decimiters cubed for reds and 134.0 milligrams/decimiters cubed for whites

What is/are the main feature(s) of interest in your dataset?

I would like to see what physicochemical properties of wine impact the quality ratings of the wines sampled. I will try to predict quality using an OLS regression model. In the case study for this course, the instructor used the diamonds dataset to predict the price of diamonds, so I thought it would be interesting to do something similar here. I do not intend to show any causal relationships between quality and the physicochemical properties measured in the wine samples. The main regressor of interest is volatile acidity, because in the variables description provided with the dataset, the authors state that volatile acidity is, “the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste”. Based on this information, I thought that volatile acidity would play the largest role in the assesment of quality, if there is any relationship between quality and the physicochemical properties at all.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Residual sugar, chlorides, total sulfur dioxide, and pH most likely impact the ratings of quality as well. All these variables will be explored later in the dataset.

Did you create any new variables from existing variables in the dataset?

Yes the scale of dryness to sweetness based on the Reisling sugar guidlines was added to the white wine dataset. Also, new variables were created to convert the variable ‘quality’ to both a numeric datatype and a factor datatype with 7 levels, for the sake of certain visualizations. For the linear model, the original (integer) variable ‘quality’ is used as the output variable.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There was a variable ‘X’ which represented the row count of the dataset. This column was dropped because the R dataframe structure takes care of that for us. I also performed a vertical join on the two wine datasets, to create a dataset that accounts for every observation of wine. This was done to find general information about the data that is not specific to the color of the wines.


Bivariate Plots Section

The varaibles description states that sulphates are, “a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant”, so I was interested to look at the relationship between sulphates and free sulfur dioxide (“free” refers to sulfur dioxide in gas form). However, it appears that the relationship is slightly negative, which is not what I would have expected. This as a disappointing find. In the plot above, the alpha parameter is set to 1/30. Sulfur dioxide is said to be undetectable in wine (with the exception of levels over 50 ppm) so I am curious wether or not there is a relationshiop between sulfur dioxide and quality.

This scatter plot represents the relationship between quality and total sulfur dioxide. As you can see, there is a lot of variance in the data, so it is hard to see a relationship visualy. Perhaps it would be more useful to asses the relationship numerically. We will see the relationship between quality against total sulfur dioxide in the first regression table, but first lets look at some other plots.

Boxplots offer a better visual representation of the relationship between quality and sulfur dioxide. You can see that the red wines rated with a quality of ‘5’ has the most sulfur dioxide on average and the white wines rated with a quality of ‘3’ have the most sulfur dioxide on average. The quartile range is so small for wines rated with a quality of ‘9’ because there are only 5 such wines observed in the dataset.

## [1] "White Wine Total Sulfur Dioxide"
## winew$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    19.0   105.8   159.5   170.6   210.0   440.0 
## -------------------------------------------------------- 
## winew$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0    85.0   117.0   125.3   171.5   272.0 
## -------------------------------------------------------- 
## winew$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   121.0   151.0   150.9   182.0   344.0 
## -------------------------------------------------------- 
## winew$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    18.0   107.2   132.0   137.0   164.0   294.0 
## -------------------------------------------------------- 
## winew$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    34.0   101.0   122.0   125.1   144.2   229.0 
## -------------------------------------------------------- 
## winew$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    59.0   102.5   122.0   126.2   150.0   212.5 
## -------------------------------------------------------- 
## winew$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      85     113     119     116     124     139
## [1] "Red Wine Total Sulfur Dioxide"
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0    12.5    15.0    24.9    42.5    49.0 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   14.00   26.00   36.25   49.00  119.00 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   26.00   47.00   56.51   84.00  155.00 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   23.00   35.00   40.87   54.00  165.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   17.50   27.00   35.02   43.00  289.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   16.00   21.50   33.44   43.00   88.00

The tables above offer an numerical representation of the relationship between total sulfur dioxide and quality. As we can see agian, white wines have much higher levels of total sulfur dioxide. Could this be true because the process of fermenting white grapes is more prone to bacteria? It could also be related to the difference in the levels of volatile acidity. Before we explore volatile acidity more, lets look at a coefficient from a linear model.

## 
## Table 1.0
## =========================================================================
##                                            Dependent variable:           
##                                 -----------------------------------------
##                                                  quality                 
##                                 sulfur dioxide in mg/ sulfur dioxide in g
##                                          (1)                  (2)        
## -------------------------------------------------------------------------
## total.sulfur.dioxide                  -0.001***                          
##                                       (0.0002)                           
##                                                                          
## I(total.sulfur.dioxide/1000)                               -0.639***     
##                                                             (0.194)      
##                                                                          
## Constant                              5.892***             5.892***      
##                                        (0.025)              (0.025)      
##                                                                          
## -------------------------------------------------------------------------
## Observations                            6,497                6,497       
## R2                                      0.002                0.002       
## Adjusted R2                             0.002                0.002       
## Residual Std. Error (df = 6495)         0.873                0.873       
## F Statistic (df = 1; 6495)            11.143***            11.143***     
## =========================================================================
## Note:                                         *p<0.1; **p<0.05; ***p<0.01

Without holding any other variables constant, a 1 gram/dm^3 increase in total sulfur dioxide results in a 0.639 decrease in quality rating. However, the sulfur dioxide levels were converted to grams in order to more accurately compare the coefficients on sulfur dioxide vs volatile acidity(volatile acidity is mesured in grams), so it is unlikely that a whole gram of sulfur dioxide will be present in a wine sample. In fact, the max amount of sulfur dioxide present in a wine sample is 440 mg/dm^3. That said, a one mg/dm^3 increase in sulfur dioxide results in a 0.001 decrease in quality rating. This may confirm the idea that sulfur dioxide is more or less undetectable in wine. Though, without holding other variables constant, we cannot rule out the bias from omitting relavent variables, so any conclusions would be premature. Looking at the effect of a one unit change on quality output may not be the best way to investigate the relationships, so lets look at a few more visualizations to get a better sense of what’s goin on.

Another scatter plot with quality on the x-axis. Again, this is not the best visual representation of the relationship. It would likely be more useful to look at some box plots, so lets do that.

For red wine, you can see a consistant decrease of the median volatile acidity levels as quality increases with the exception of the increase from 7 to 8. For white wines, the median levels of volatile acidity bounce around much more than for red wines. It could be that theres a relationship between volatile acidity and variables that are more prevelent in white wine, such as residual sugar. Before we explore other variables, lets look at another regression table.

## 
## Table 1.1
## =================================================================================================
##                                                  Dependent variable:                             
##                     -----------------------------------------------------------------------------
##                                                        quality                                   
##                             all wines                white wine                 red wine         
##                                (1)                       (2)                       (3)           
## -------------------------------------------------------------------------------------------------
## volatile.acidity            -1.409***                 -1.711***                 -1.761***        
##                              (0.060)                   (0.133)                   (0.110)         
##                                                                                                  
## Constant                    6.297***                  6.354***                  6.566***         
##                              (0.023)                   (0.037)                   (0.062)         
##                                                                                                  
## -------------------------------------------------------------------------------------------------
## Observations                  6,497                     4,898                     1,599          
## R2                            0.071                     0.038                     0.153          
## Adjusted R2                   0.070                     0.038                     0.152          
## Residual Std. Error     0.842 (df = 6495)         0.869 (df = 4896)         0.744 (df = 1597)    
## F Statistic         493.351*** (df = 1; 6495) 192.958*** (df = 1; 4896) 287.444*** (df = 1; 1597)
## =================================================================================================
## Note:                                                                 *p<0.1; **p<0.05; ***p<0.01

A one g/dm^3 increase in volatile acidity results in a 1.711 decrease in quality rating for white wines without holding any other variables constant. Additionally, a one g/dm^3 increase in volatile acidity results in a 1.761 decrease in quality rating without holding any other variables constant. To compare the impact of sulfur dioxide to the impact of volatile acidity, I converted sulfur dioxide to grams/dm^3. Sulfur dioxide had a coefficient of 0.639, so the impact of volatile acidity appears greater, not to mention the fact that a whole g of sulfur dioxide is unlikely to be present in any sample. Another method would have been to standardize the coefficients and compare them, but I figured since it is such a simple conversion that this comparison would be sufficient to gain some insight.

As I mentioned earlier, there could be a relationship between volatile acidity and residual sugar. Lets look at some plots to investigate this question. But first, I realized later in my multivariate analysis that this coefficient may be very misleading. It is very unlikely that the volatile acidity would increase by one gram, as the maximum amount of volatile acidity recorded is 1.58, and the standard deviation is 0.16. To compare the two variables, I think it best ot standardize the coefficients.

## 
## Table 1.2
## ========================================================================
##                                      Dependent variable:                
##                      ---------------------------------------------------
##                                            quality                      
##                             white wine                 red wine         
##                                 (1)                       (2)           
## ------------------------------------------------------------------------
## volatile.acidity             -1.708***                 -1.587***        
##                               (0.113)                   (0.135)         
##                                                                         
## total.sulfur.dioxide         -0.004***                 -0.003***        
##                               (0.001)                  (0.0003)         
##                                                                         
## Constant                     6.715***                  6.777***         
##                               (0.065)                   (0.051)         
##                                                                         
## ------------------------------------------------------------------------
## Observations                   1,599                     4,898          
## R2                             0.177                     0.063          
## Adjusted R2                    0.176                     0.062          
## Residual Std. Error      0.733 (df = 1596)         0.858 (df = 4895)    
## F Statistic          171.358*** (df = 2; 1596) 164.217*** (df = 2; 4895)
## ========================================================================
## Note:                                        *p<0.1; **p<0.05; ***p<0.01
## [1] "Red wine"
##     volatile.acidity total.sulfur.dioxide 
##           -0.3786172           -0.1561474
## [1] "White wine"
##     volatile.acidity total.sulfur.dioxide 
##           -0.1805645           -0.1586199

A standard deviation increase in volatile acidity results in a 0.38 decrease in quality rating for red wine and a 0.18 decrease in quality rating for white wine.A standard deviation increase in total sulfur dioxide results in a 0.156 decrease in quality rating for red wine and a 0.159 decrease in quality rating for white wine. Now lets look at some plots to get a sense of the relationship I was curious about earlier.


There is a lot of noise in the plots above, so the alpha parameter was set to 1/20. This makes the relationship somewhat easier to see, but there is a lot of variance in the data. But speaking of sugar, I wonder how the reisling sugar guidlines relate to quality ratings.

## [1] "Quality by reisling scale"
## winew$scale: Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.884   6.000   9.000 
## -------------------------------------------------------- 
## winew$scale: Medium Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.914   6.000   9.000 
## -------------------------------------------------------- 
## winew$scale: Medium Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.776   6.000   8.000 
## -------------------------------------------------------- 
## winew$scale: Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   5.000   5.000   5.519   6.000   7.000

The median quality rating for dry, medium dry, and medium sweet is 6. The median quality rating for sweet is 5. I know that when fermenting wine, sugar is converted to alcohol (put simply). So I wonder how residual sugar relates to alcohol content. Obviously, the residual sugar has nothing to do with the sugar converted into alcohol, but there could be a relationship between residual sugar and total sugar present before the fermentation process.

Assuming that more residual sugar present in the wine sample means less sugar converted to alcohol, this relationship makes sense. Lets see if the relationship holds when looking at the alcohol content of the reisling sugar guidlines scale.

## [1] "percent alcohol by reisling scale"
## winew$scale: Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.90   10.80   10.89   11.80   14.20 
## -------------------------------------------------------- 
## winew$scale: Medium Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.30   10.00   10.28   11.00   14.05 
## -------------------------------------------------------- 
## winew$scale: Medium Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.10    9.70    9.91   10.50   14.00 
## -------------------------------------------------------- 
## winew$scale: Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.600   9.000   9.550   9.824  10.200  12.400

Wines rated as ‘dry’ according to the reisling scale have the highest median alcohol content. At first I thought the oposite would be true, but then I thought about it a little more. Perhaps in the wines rated as ‘dry’ have the most sugar converted to alcohol during the fermentation process. However, the reisling scale also relies on acid ratios an pH, so its hard to tell.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The bivariate analysis revealed some interesting trends. The relationship between sulphates and sulfur dioxide was much weaker than I expected. The variables description led me to believe that the realtionship would be strong. Additionally, the impact of volatile acidity on quality output appears greater than the impact of total sulfur dioxide on quality output, especially for red wine. As I mentioned earlier, we should add some other variables to the linear model in order to get a clearer insight on what is actually going on. Lastly, the median quality rating does not vary greatly between the different reisling sugar guidline categories.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Yes. An interesting find for me was the variance in the median alcohol content across the different reisling sugar guidline categories. I found it interesting that wines marked as “Dry” had the highest median alcohol content, by 0.8 percentage points. The median alcohol content of wines marked as “Dry” is greater than the median alcohol content of wines marked as “Sweet” by 1.25 percentage points.

What was the strongest relationship you found?

So far, the strongest relationship between quality output and the pysicochemical properties has been the relationship between volatile acidity and quality. I initially thought that the impact of volatile acidity would be much greater than that of total sulfur dioxide, and a first glance at the coefficients certainly makes it seem so. For red wine, when the coefficients were standardized, volatile acidity had over twice the impact than total sulfur dioxide. For white wine, however, the difference was much less extreme. I’m sure there are relationships between some of the chemical properties that are very strong, such as the relationship between free sulfur dioxide and total sulfur dioxide, or citric acid and volatile acidity. However, including each of these in a linear model could result in issues with multicolinearity. It would be more useful to look at relationships between chemical properties that are not so obvious (volatile acidity against residual sugar) and relationships between chemical properties and quality output. ********************************************

Multivariate Plots Section

For quality ratings of 4,5, and 6 wines rated as ‘Dry’ seem to overlap the grand mean. Wines rated ‘sweet’ and with a quality of 6 have more volatile acidity than average. Perhaps the level of sweetnes tends to mask the taste of volatile acidity. This visual is interesting in that there is a consistent decrease in volatile acidity as quality increases, then seems to spike back up at the end. However, for wines rated at a quality of 7, ‘sweet’ wines appear to have very low levels of volatile acidity.

The grand mean follows a peculiar pattern, spiking from wines rated with a quality of 3 to wines rated 4, then decreasing from 4 - 6, and steadily increasing from 7-9. I expected to see a more steady decrease as quality increases.

This plot represents the same variables as the line plot , but the data are displayed in box plots faceted by reisling scale. This gives us some useful information, although the line plots are better for a general sense of the data. We can see that there are no wines rated ‘sweet’ with a quality of 9. Wines rated ‘Dry’ and ‘medium dry’ tend to have the highest levels of volatile acidity for each quality category, with the exception of 6.

The same plot as above but with added jitter and an alpha parameter of 1/10.

Here is another visual exploring the relationship between residual sugar and volatile acidity, this time throwing the reisling sugar guidlings into the mix. As we would expect, you can see than the reisling scale gets sweeter as residual sugar increases. This is simply because the residual sugar is built in the the function that generated the reisling scale variable. There are some wines rated medium sweet that have lower amounts of residual sugar, and that is likely caused by the shift due to pH. Now that we have seen some interesting visuals,lets look at some numerical relationships with a multivariate regression.


## 
## Table 1.3 (Red Wine)
## ==================================================================================================
##                                                   Dependent variable:                             
##                      -----------------------------------------------------------------------------
##                                                         quality                                   
##                                Reg 1                     Reg 2                     Reg 3          
##                                 (1)                       (2)                       (3)           
## --------------------------------------------------------------------------------------------------
## volatile.acidity             -1.708***                 -1.371***                 -1.245***        
##                               (0.113)                   (0.112)                   (0.113)         
##                                                                                                   
## total.sulfur.dioxide         -0.004***                 -0.002***                 -0.002***        
##                               (0.001)                   (0.001)                   (0.001)         
##                                                                                                   
## residual.sugar                                           0.008                     0.005          
##                                                         (0.014)                   (0.015)         
##                                                                                                   
## alcohol                                                0.301***                  0.313***         
##                                                         (0.019)                   (0.020)         
##                                                                                                   
## chlorides                                                                         -0.722*         
##                                                                                   (0.377)         
##                                                                                                   
## pH                                                                               -0.491***        
##                                                                                   (0.129)         
##                                                                                                   
## Constant                     6.715***                  3.298***                  4.807***         
##                               (0.065)                   (0.227)                   (0.448)         
##                                                                                                   
## --------------------------------------------------------------------------------------------------
## Observations                   1,599                     1,599                     1,599          
## R2                             0.177                     0.323                     0.331          
## Adjusted R2                    0.176                     0.322                     0.328          
## Residual Std. Error      0.733 (df = 1596)         0.665 (df = 1594)         0.662 (df = 1592)    
## F Statistic          171.358*** (df = 2; 1596) 190.403*** (df = 4; 1594) 131.223*** (df = 6; 1592)
## ==================================================================================================
## Note:                                                                  *p<0.1; **p<0.05; ***p<0.01

For red wine, a 1 g/dm^3 increase in volatile acidity results in a 1.245 decrease in quality rating, all else equal. The impact of sulfur dioxide does not seem to change much when controlling for other properties; a 1 mg/dm^3 increase in total sulfur dioxide results in a .002 decrease in quality output, holding the other variables constant. A 1 percentage point increase in alcohol by volume results in a 0.313 increase in quality output, ceteris paribus. The coefficent for residual sugar is not significantly different than zero. The coefficient for chlorides is only significant at the 10 percent level. Lastly, a 1 unit increase in pH results in a 0.491 decrease in quality output, holding relevent variables constant. I became worried that including pH and volatile acidity would result in issues with multicolinearity, but I tested their pearsons r correlation, and the correlation was 0.26 (1 being perfectly colinear).

## 
## Table 1.4 (White Wine)
## ==================================================================================================
##                                                   Dependent variable:                             
##                      -----------------------------------------------------------------------------
##                                                         quality                                   
##                                reg 1                     reg 2                     reg 3          
##                                 (1)                       (2)                       (3)           
## --------------------------------------------------------------------------------------------------
## volatile.acidity             -1.587***                 -2.129***                 -2.093***        
##                               (0.135)                   (0.113)                   (0.114)         
##                                                                                                   
## total.sulfur.dioxide         -0.003***                   0.001                    0.0004          
##                              (0.0003)                  (0.0003)                  (0.0003)         
##                                                                                                   
## residual.sugar                                         0.026***                  0.027***         
##                                                         (0.002)                   (0.003)         
##                                                                                                   
## alcohol                                                0.381***                  0.372***         
##                                                         (0.011)                   (0.011)         
##                                                                                                   
## chlorides                                                                         -0.794*         
##                                                                                   (0.461)         
##                                                                                                   
## pH                                                                               0.337***         
##                                                                                   (0.079)         
##                                                                                                   
## Constant                     6.777***                  2.226***                  1.282***         
##                               (0.051)                   (0.142)                   (0.290)         
##                                                                                                   
## --------------------------------------------------------------------------------------------------
## Observations                   4,898                     4,898                     4,898          
## R2                             0.063                     0.259                     0.263          
## Adjusted R2                    0.062                     0.258                     0.262          
## Residual Std. Error      0.858 (df = 4895)         0.763 (df = 4893)         0.761 (df = 4891)    
## F Statistic          164.217*** (df = 2; 4895) 427.607*** (df = 4; 4893) 290.323*** (df = 6; 4891)
## ==================================================================================================
## Note:                                                                  *p<0.1; **p<0.05; ***p<0.01

Unlike red wines, the coefficient for residual sugar is statisticaly significant at the 1 percent level. Also unlike red wines, when holding other relative variables constant, the effect of total sulfur dioxide is not significantly different than zero. A 1 g/dm^3 increase in volatile acidity results in a 2.093 decrease in quality output. A one g/dm^3 increase in residual sugar results in a 0.027 increase in quality output, all else equal. Lastly, a one percentage point increase in alcohol by volume results in a 0.337 increase in quality output, quite similar to red wine, all else equal.

I would like to point out that in the two regression tables above, the units and ranges of values vary so greatly, that it may be more useful to standardize the coefficients again. This will alow us to look at the impact of a standard deviation change for each variable, rather than a unit change. You can find the outputs of the standardized variables (from the third regression in each table) below.

## [1] "Red wine"
##     volatile.acidity total.sulfur.dioxide       residual.sugar 
##         -0.276019925         -0.085139606          0.008500037 
##              alcohol            chlorides                   pH 
##          0.412444502         -0.042069834         -0.093778891

For red wine, a standard deviation change in alcohol appears to have the highest impact on quality output, almost double that of volatile acidity, which comes in second place.

## [1] "White wine"
##     volatile.acidity total.sulfur.dioxide       residual.sugar 
##          -0.23825591           0.01977806           0.15550144 
##              alcohol            chlorides                   pH 
##           0.51663189          -0.01957770           0.05745906

For white wine, a standard deviation change in alcohol has an even higher impact than it does for red wine, more than double the impact of volatile acidity, which again comes in second place. Residual sugar appears to have an enormously larger impact for white wine than it does for red wine when standardizing the coefficients.

## [1] "White Wine Prediction"
## 1 
## 6
## [1] "Red Wine Prediction"
## 1 
## 5

When the output is rounded, the prediction is spot on. I created a data frame for red wine and white wine based on actual data in the original datasets, and plugged in those actual values into the last regression in each table (1.2,1.3). When rounded, the output matched the value under the quality column exactly.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

I was suprised that volatile acidity does not appear to have the largest impact on quality output. For red wine, when holding other variables constant, the impact of volatile acidity appears more positeve(although still negative, of course). This implies something about the relationship of the other relative variables to volatile acidity, which we will look at more in the final plots section. For white wine the opposite is true, controlling for other relevant variables causes volatile acidity to appear more negative. This is strange, but again implies something tricky going on with omitted variable bias. It may have to do with the fact that residual sugar is statistically significant for white wine but not for red wine.

Were there any interesting or surprising interactions between features?

Yes, when standardizing the coefficients, alcohol appears to have the greatest impact on quality output, with a one standard deviation change resulting in a 0.52 increase in quality output for white wines and a 0.41 quality increase in quality output for red wines, all else equal. I certainly expected alcohol to have and impact, but I thought that the effect of pH and volatile acidity would be stronger. Volatile acidity comes in second place, which seems to confirm the idea that too much volatile acidity can result in a nasty vinegar taste. It is probably true that volatile acidity in lower levels is more or less undetectable, but in high doses can result in poorly tasting wine.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Yes, I utilized a linear model with quality as the output variable and six regressors. I want to be clear that none of the results should be seen as facts. I am not a PhD in statistics; I do not even have a masters degree. I am familier with regression models, but my knowledge is limited to the undergraduate level. One limitation is that linear regression assumes that the different values of the response variable are homosketastic, when it reality the errors are almost always heterosketastic. I attempt to correct this assumption calculating robust standard errors, but there are other techniques I could have utilized to correct for this (such as using a log model). The linear model also assumes that predictor variables are free of measurement errors, and that assumption is also not very realistic in practice. Also, I left out a few predictor variables from the datasets that may or may not have been relevant to quality output, and some that I thought may have resulted in issues with multicolinearity (i.e. including fixed acidity and volatile acidity, or citric acid and volatile acidity). It may have been useful to include more variables and experiment with more robust regressions, but the purpose of this project was not solely to provide a strong mathematical model. The purpose was also to provide strong visuals to explore certain anamolies in the data, and to be curious about the data. Another limitation is that the units of the variables vary widely. I attempt to correct for this issue by standardizing the coefficients, but there are fundamental limitations to that technique as well. For example, a standard deviation change in alcohol content may be difficult to interpret comparatively. Lastly, the quality variable is based on sensory data, which is very subjective.


Final Plots and Summary

Plot One

Description One

This plot displays the quality distribution of the white wine dataset, with the reisling sugar guidlines factored in. Wines rated as ‘Dry’ have the highest frequency for each quality level. We explored the relationship between the reisling scale and residual sugar earlier, but I wonder how that relationship relates to quality output.

Plot Two

Description Two

Wines with relatively low alcohol content have the highest amount of residual sugar at every quality level except for wines rated with a quality of 3. Again, this is likely because less sugar is converted to alcohol during the fermentation process. Wines with a moderate alcohol content spike above average at a quality of 6-8. Wines with a relatively high alcohol contain remain below the average amount of residual sugar across all quality levels. Wines with above average alcohol content (but not high) remain below the average amount of residual sugar from qualities of 4-8, but spike far above average for wines rated with a quality of 9. This is an interesting anomoly.

Plot Three

Description Three

Again we explore the relationship between sulphates and total sulfur dioxide, this time coloring by quality rating. There seems to be a lot of variance in the quality ratings. There are many higher-quality points with a relatively low amount of sulphates and a relatively high amound of sulfur dioxide, but there are also a lot of higher-quality points with a relatively low level of sulfur dioxide and a higher amount of sulphates. Perhaps in good wine, sulphates and sulfur dioxide are substitutes (you would not want high levels of both, but you would want one or the other).


Reflection

Exploring the relationships between quality ratings and the physicochemical properties of wine was full of suprises and challenges. At first glance, it appeared that volatile acidity had a much larger impact than it actually does. The issue was that volatile acidity was measured in g/dm^3, but only 2% of the wines observed had volatile acidity measured larger than 0.78g/dm^3, so looking at a one unit increase for volatile acidity was not useful.

Another challenge was figuring out what variable relationships to expore. In the variables decription text file provided with the datasets, the authors explicitely state, “we are not sure if all input variables are relevant.” The unkown relevance of variables in a dataset is very common in the real world, so it was interesting to see if I could build a model that would predict wine quality somewhat accurately.

One aspect that I think went well was the quality predictions based on the linear regressions for white wine and red wine respectively. When rounding the output, the results were accurate. Also, I am glad that I ended up standardizing the coefficients, because I was finding every excuse not to originally.

I have several ideas for further study. First, I noticed that the authers of the original study used a support vector machine regression technique, and I thought about giving that a try here, but I did not want to go over my head. However, I also noticed that we study svm regression later in this course, so it will be interesting to revisit this data once I learn more. Also, It would be interesting to include log model and to include more variables. Using a log model allows us to view the unit change in x in terms of percent rather than a unit change in y.